Goto

Collaborating Authors

 Casper


PhD Knowledge Not Required: A Reasoning Challenge for Large Language Models

Anderson, Carolyn Jane, Biswas, Joydeep, Boruch-Gruszecki, Aleksander, Cassano, Federico, Feldman, Molly Q, Guha, Arjun, Lucchetti, Francesca, Wu, Zixuan

arXiv.org Artificial Intelligence

Existing benchmarks for frontier models often test specialized, ``PhD-level'' knowledge that is difficult for non-experts to grasp. In contrast, we present a benchmark based on the NPR Sunday Puzzle Challenge that requires only general knowledge. Our benchmark is challenging for both humans and models, however correct solutions are easy to verify, and models' mistakes are easy to spot. Our work reveals capability gaps that are not evident in existing benchmarks: OpenAI o1 significantly outperforms other reasoning models that are on par on benchmarks that test specialized knowledge. Furthermore, our analysis of reasoning outputs uncovers new kinds of failures. DeepSeek R1, for instance, often concedes with ``I give up'' before providing an answer that it knows is wrong. R1 can also be remarkably ``uncertain'' in its output and in rare cases, it does not ``finish thinking,'' which suggests the need for an inference-time technique to ``wrap up'' before the context window limit is reached. We also quantify the effectiveness of reasoning longer with R1 and Gemini Thinking to identify the point beyond which more reasoning is unlikely to improve accuracy on our benchmark.


MINION: a Large-Scale and Diverse Dataset for Multilingual Event Detection

Veyseh, Amir Pouran Ben, Van Nguyen, Minh, Dernoncourt, Franck, Nguyen, Thien Huu

arXiv.org Artificial Intelligence

Event Detection (ED) is the task of identifying and classifying trigger words of event mentions in text. Despite considerable research efforts in recent years for English text, the task of ED in other languages has been significantly less explored. Switching to non-English languages, important research questions for ED include how well existing ED models perform on different languages, how challenging ED is in other languages, and how well ED knowledge and annotation can be transferred across languages. To answer those questions, it is crucial to obtain multilingual ED datasets that provide consistent event annotation for multiple languages. There exist some multilingual ED datasets; however, they tend to cover a handful of languages and mainly focus on popular ones. Many languages are not covered in existing multilingual ED datasets. In addition, the current datasets are often small and not accessible to the public. To overcome those shortcomings, we introduce a new large-scale multilingual dataset for ED (called MINION) that consistently annotates events for 8 different languages; 5 of them have not been supported by existing multilingual datasets. We also perform extensive experiments and analysis to demonstrate the challenges and transferability of ED across languages in MINION that in all call for more research effort in this area.


AI can figure out a place's politics by analyzing cars on Google Street View

Popular Science

Google Street View images are filled with cars. That is a simple and pedestrian truth, and one which artificial intelligence researchers have taken advantage of to do something surprising. By analyzing car type, they were able to make predictions about the demographic information of the people in the cities they studied. For example, the team, largely from Stanford University, analyzed whether they saw more pickups trucks or sedans in a given city. With a greater number of pickup trucks, the urban area had an 82 percent chance of voting Republican, and with more sedans, there was an 88 percent chance it voted Democrat. Artificial intelligence systems shine when crunching staggeringly large amounts of data and then making predictions about what they see in it.


Janice: Excited for eclipse

FOX News

I was 8-years-old and remember being both terrified and intrigued about something that was being talked about everywhere. This wasn't a storyline out of a science fiction movie or novel, this was real, and happening here on Earth. Millions of people were going to witness something that maybe happens a couple of times in our lifetime: A total solar eclipse. Our teachers were planning lessons about this incredible celestial event. Chalkboard diagrams, planetary mobiles and handmade viewing devices were being created out of shoe boxes.


Fine-Grained Car Detection for Visual Census Estimation

Gebru, Timnit (Stanford University) | Krause, Jonathan (Stanford University) | Wang, Yilun (Stanford University) | Chen, Duyun (Stanford University) | Deng, Jia (University of Michigan) | Fei-Fei, Li (Stanford University)

AAAI Conferences

Targeted socio-economic policies require an accurate understanding of a country’s demographic makeup. To that end, the United States spends more than 1 billion dollars a year gathering census data such as race, gender, education, occupation and unemployment rates. Compared to the traditional method of collecting surveys across many years which is costly and labor intensive, data-driven, machine learning-driven approaches are cheaper and faster—with the potential ability to detect trends in close to real time. In this work, we leverage the ubiquity of Google Street View images and develop a computer vision pipeline to predict income, per capita carbon emission, crime rates and other city attributes from a single source of publicly available visual data. We first detect cars in 50 million images across 200 of the largest US cities and train a model to predict demographic attributes using the detected cars. To facilitate our work, we have collected the largest and most challenging fine-grained dataset reported to date consisting of over 2600 classes of cars comprised of images from Google Street View and other web sources, classified by car experts to account for even the most subtle of visual differences. We use this data to construct the largest scale fine-grained detection system reported to date. Our prediction results correlate well with ground truth income data (r=0.82), Massachusetts department of vehicle registration, and sources investigating crime rates, income segregation, per capita carbon emission, and other market research. Finally, we learn interesting relationships between cars and neighborhoods allowing us to perform the first large scale sociological analysis of cities using computer vision techniques.